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ABSTRACT 

Problems associated with cr iterion'^ref erenced 
language testing are discussed in the context of bot «». standardized 
proficiency testing and classroom assessment. First, different 
interpretations of criter ion*ref erencing are examined. A range of 
approaches for defining criteria and performance levels in second 
language assessment are outlined, and some issues that have arisen in 
defining and applying these criteria are discussed, including the 
difficulties of defining the nature of proficiency and the failure of 
expert judges to agree on criteria. Finally, a discussion is given of 
research directions that might lead to language assessment criteria 
that incorporate multiple perspectives on learners* communicative 
needs and derive from em|>irical data on second language acquisition, 
variability in language use, and communicative competence. A 73-item 
bibliography is included. (MSE) 
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GeojJ Brindley 
INTRODUCTION 



In recent years, there has been a move lowards llie wider use of crilcrion- 
rcfcrcnccd (CR) methods of assessing second language ability which allow 
learners' language performance to be described and judged in relation to defined 
behavioural criteria. This is in line with the concern among language testers to 
provide meaningful information about what testees are able to do with the 
language rather than merely providing test scores. However, while criterion- 
referencing has enabled language testers to be more explicit about what is being 
assessed, there are numerous problems associated with the development, 
interpretation and use of CR methods of assessment, to such an extent that the 
feasibility of true criterion-referencing has been questioned by some writers (eg. 
Skehan 1984, 1989). 

This paper aims to illustrate and discuss the nature of these problems in the 
context of both standardized proficiency testing and classroom assessrAenl. 
First, different interpretations of "criterion-referencing" will be examined. 
Following this, a range of approaches to denning criteria and performance levels 
'in second language assessment will be outlined and some of the issues which 
have arisen in defining and applying these crilcrlc: will be discussed, including the 
difficulties of defining the nature of "proficiency" and the failure of expert judges 
to agree on criteria* Finally, research directions will be indicated that might lead 
to language assessment criteria which incorporate multiple perspectives on 
learners' communicative needs and which derive from empirical data on second 
language acquisition and use. 



CRITERION-REFERENCIN(; 

The term "criterion- referenced" has been interpreted in a variety of ways in 
both general education and language learning. In their original forniuklion of 
the concept. Olaser and Klaus (1962: 422), in the context of proficiency 
measurement In military and industrial training, staled that 
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0 ^^le^ge of an individuaVs score on a criterion-referenced measure provides 
explicit i;ifonnation as to ^vliat the individual can or cannot do 

0!ascr (1003) described crilerion-referenced assessment (CRA) Ihus: 

Vie degree to which his uchic-'ement resembles desired performance at any 
specified level is assessed by criterion-referenced measures of aJiievenient or 
proficiency. Vie standard against which a sludenVs performance is compared 
when measured in this manner is the behaviour which defines each point 
along (he achievement conlinuunu Vie term 'criterion*, when used this way, 
does not necessarily refer to final end-of-course behaviour. Criterion levels 
can be established at any point in instruction as to the adequacy of an 
individuaVs performance. Vie point is that the specific behaviours implied at 
each level of proficiency can he identified and used to describe the specific 
tasks a student must be capable of performing before he achieves each of 
these k: r/wlcdge levels. It is in this sense that measures of proficiency can be 
criicrion-refcrcnccd. 

This eari> , finilion of CRA highlights several key elements which are 
reflected in various kinds of language assessment inslruinents: first, proficiefKy 
(here, interestingly, not distinguished very clearly from achievement) is 
conceived of as a ccntinuum ranging from no proficiency at all to "perfect" 
proficiency; second, the criterion is defined as an external standard against which 
learner behaviour is compared; and third, levels of proficiency (or achievement) 
are linked to specific ^isks. 



CRITERION REFERENCING IN LANGUA(;E ASSESSMENT 

In the context of language learning, CRA has number of different meanings 
(Skelian 1989: 5-6). In the first instance, it refers in a general sense to tests or 
assessments which arc based on sampling of a behavioural domain and which 
?nake explicit the features of this domain. For example, in an oral interview, a 
tcslee might be given a score on a rating scale which contains the key aspects of 
performance (that is, the criteria) lo be assessed such as fiuency, appropriacy, 
accuracy, pronunciation, grammar etc. These criteria may then be described 
more fully in a band or level description. As Skehan (1984: 217) notes, such 
descriptions represent a set of generalised behaviours relating performance lo 
exiernal criteria (referred to by Jones 1985: 82 as performance criterion), 
rallrcr than a slatemenl that would enable a yes/no decision lo be made with 
respccl to a teslec's ability on a particular task. 
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4^ wc have seen, CRA also carries a second meaning of a standard 
(criterion level) or cut-off point vi^liich may be denned with reference to some 
external requirement. In the context of language assessment, this might be 
exemplified by the "threshold level" set by the Council of Europe as a minimal 
level of functional language competence. Some writers, in fact, posil Ihe 
existence of a constant and "natural reference poinf for this external standard in 
the form of the native speaker (sec, ior example, C/.iko 1983: 294). 

Skehan (1 W) also sutigcMs a third sense in whicii CRA can be interpreted: 

77i/s is that ihe proficiency levels which are tlte basis for criieiion-rtferencjtg 

are 'inked in some cimuilalive way to a come of dcvclopmeitu 

This raises Ihe issue of whether assessmcnl criteria should lake as their 
reference point what learners do, what linguists and teachers think learners do 
or what native speakers do. This point will be taken up later. 



NORM-REFERENCING VERSUS CRITERION-REFERENCING 

CRA is traditionally contrasted with norm-referenced methods of 
assessment which are meant to compare individual's performances relative to 
each other and to distribute them along the normal curve, not to establish the 
degree to which students have mastered a particular skill (Hudson and Lynch 
1984: 172). Large-scale standardized examinations, in which students arc given 
aggregate scores or grades for purposes of selection, certification or placement 
arc probably the best-known example of norm-referenced assessment. An 
example of a norm-referenced approach from second language learning would 
be proficiency test batteries in which results are reported solely in terms of an 
overall score (a range of such tests is described by Alderson, Krahnke and 
Stansfield 1987). 

According to some authors, however, the differences between norm- 
referenced assessment and CRA however, arc not as great as conventionally 
imagined. Rowntree (1987: 185-6), for example, notes that criterion level* arc 
frequently established by using population norms; 

So much assessment that appears to he criterion-referenced is, (n a sense, 
nomweferonced. Tlw difference is that the stiidvnt's potfommce L judged 
and labelled by comparison with the norms established by oiht? students 
elsewhere rattier than those eslablislud by his inunedimo folhw-xliidenls, 
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TrtMJ^is'an element of both norm- and crilcrion-referencing about the way in 
which proficiency descriptions are drawn up and interpreted. For example, one 
method of defining assessment criteria and performance descriptors for writing 
proficiency is to ask experienced teachers, without the aid of any explicit criteria, 
to rank learners in order of proficiency by sv)rting a set of writing scripts into 
piles representing clearly definable f)roficiciK7 differences. Following this, the 
characteristic features ol scripts at each level are discussed and these are then 
used to esiablish criteria and performance descriptors. 

The level descriptions in proficiency scales, as numerous authors have 
pointed out (eg. Trim 1977, Skchan 1984), often contain norm-referenced 
terminology despite their claim to be criterion-referenced. Terminology such as 
^'greater flexibility" or "fewer errors" relates the levels to each other instead of to 
the external standard which is supposed to characterise criterion-referencing. In 
terms of their actual use, as well, the descriptors may be interpreted in covertly 
norm-referenced ways. It is not unusual, for example, to hear teachers refer to a 
"good Level 1", a "slow Level 2" etc. 



DEVELOPING CRITEkIA AND DESCRIHING PERFORMANCE 
Real world and classroom dimensions of CRA 

CRA has both a real-world and a classroom dimension. In the 
development of a proficiency test aimed at assessing real-world language use, 
defining criteria involves operationalising the construct of proficiency - in othei' 
words, specifying th*? skills and abilities which constitute the test developer's view 
of "what it means to know how to use a language" (Spolsky 1986). From the test 
specifications thus established, items are constructed and/or level/band 
descriptions written according to which performance will be rated. This is, of 
necessity, a time-consuming and rigorous process involving empirical studies of 
performance samples, consultation with expert judges and continuing revision of 
criteria and descriptors (see, for example, the descriptions by Alderson (1989) 
and Westaway (1988) of the way in which lELTS bands were derived). 

In classroom CRA which is aimed at assessing learner achievement or 
diagnosing difficulties, the process of defining criteria and descriptors involves 
specifying the behavioural domain from which objectives are drawn, formulating 
a set of relevant objectives and establishing a set of standards by which learners' 
performance is judged. In many ways, this pror3e'';s .cplicales wluit is involved in 
operationalising the construct of proficient*' , in that it involves specifying (he 
nature of the domain to be assessed and breaking this down into its componcnl 
pans. However* classroom CRA is likely to bo less formal and may rely on 
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linpuc^iudgcments on the teacher's pari as to what constitutes the domain of 
ability which is assessed (Black and Dockrell 1984: 42-43). 

It is worth noting at this point that the interpretation of "criterion" is slightly 
differeni, according to the purposes for which CRA is being carried out. Where 
learners' proficiency is being assessed in order to dcleroiinc their capacity to 
undcriakc some rcal-wnrld activity (c^. \n exercise a profession), criterion- 
referenced is often taken to mean tlial the^r performance is compared against a 
-criterion level" of performance or a cut-score. They cither reach the criterion or 
they don't. As Davies (1988: 33) notes, users of tests interpret all tcsl ruults in 
a criterion-i^efcrenced way. A candidate's actual score is of less importance than 
the question: has the candidate attained the cut score or not? 

In the classroom, howjver, the emphasis is slightly different. Here, the 
"criterion" against which learners' performance is assessed relates to a domain 
specification and a set of learning objectives. A .iainment may be assessed in 
terms of mastery/non-mastery of these objectives (see, for example, Hudson and 
Lynch 1984; Hudson 1989). However, making a yes/no decision on whether 
masrery h^s been attained can be extremely dilficult. In fact, the validity of the 
concept elf has been questioned (Glass 1978) and there are a multiplicity of 
competi , views on appropriate standard-selting methods in CRA (sec Berk 
1986 for a comprehensive discussion of the relative merits of various methods). 
For this reason, classroom CRA is often more concerned with assessing learners' 
attainment on a scale of ability which repfef,eiit5 varying degrees of mastery but 
is not necessarily linked to a '^cut-score" (see Briiulley 1989 for examples). 

In terms of co- M CR pt^oficiency testing tends to focus on assessing tasks 
which replicate r^ai life or from which inferences can be made to reaMife 
performance. As far as classroom asscssnirm is concerned, however, opinions 
differ on the question of whether CRa ,,.,ould be exclusively focussed on 
subsequent extra-classroom tasks or whelher any valid objective can be assessed 
(Brown 1981: 7). If the latter view is accepted, then it would be possible to 
imagine situations in which CRA assessment did not concern itself with elements 
of learners' communicative performance (eg. if the syllabus were grammatically- 
based). CRA does not, in other words, necessarily mean communicative 
assessment. However, in the c?.se of second language ners who have to use 
the language in society on a daily basis there are clea/ly arguments for 
accentuating methods of CRA which allow them to gain feedback on their ability 
to perform real-life tasks (see Brindlcy 1989: 91-120 for examples). 





A variety of meli.ods have be^.n used by lest developers and teachers to 
define assessment criteria and performance descriptors. These will be described 
below and some problems associated with each will be discussed. 



Use existing criteria 

The easiest w,.^ U) define criteria and descriptors for language assessment is 
to use those already in existence. There is no shortage of models and examples. 
For proficiency testing, literally thousands of rating scales, band scales and 
performance descriptors are used throughout the world. An equivalent number 
of skills taxonomies, competency checklists, objectives grids etc, are available for 
classroom use. 

Like tests, some proficiency scales seem to have acquired popular validation 
by virtue of their longevity and extracts from them regularly appear in olher 
scales. The original scale used in conjunction with the Foreign Service Institute 
Oral Interview (FSI 1968), in particular, seems to have served as a source of 
inspiration for a wide range of other instruments with a similar purpose but not 
necessarily with a similar target group. Both the Australian Second Language 
Proficiency Rating Scale (ASLPR) (Ingram 1984) and the ACTFL Proficiency 
(•uidelincs (Hiple 1987) which aim to describe in the first case the proficiency of 
adult immigrants in Australia and in the second the proficiency of foreign 
language students and teachers in the USA, draw on the FSt descriptions. 



Problems 

Although proficiency scales have gained widespread acceptance over a 
considerable period of time and appear fuce-vulid, it is very difficult to find any 
explicit information on how the descriptions were actually arrived at. Although 
some scales are claimed to be datn^-hised (see, for example, Liskin-Gasparro 
(1984: 37) who slates that the ACTFL guidelines were developed empirically), 
no information is made publicly available as to how the data were collected, 
analysed and turned into performance descriptors. This is despile the fact that 
in some cases claims arc being made (if only by inference) to the effect that the 
defcriplions constitute universal descriptions of second language developmeht. 
Byrnes (1987), for example, claims that the ACTFL/ETS scale is built on a 
"hierarchy of task universals" . 

Apart from their lack of empirical underpinning, the validity of rating scale 
descriptors (In particular the ACTFL/ETS Oral Proficiency Interview) has been 
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^0(^1^'^^^ a number of other grounds. Some of Ihe principal concerns which 
haw o^n voiced can be coughly suniinarised as follows: 

the logic of the way levels arc arrived at is essentially circular-^'lhe criteria 
are the levels and vice versa ' (Lantolf and Frawlcy 1985: 340). They cannot 
therefore be criterion-referenced in the accepted sense since there is no 
external standard against which the tesiec's behaviour may be compared. 

the incremental and lockstep nature of level descriptions fails to take into 
account the well documented variability and "backsliding- which occur in 
inlerlanguag (Pienemann, Johnston and Brindley 1988); nor can 
differential abilities in different "discourse domains" be accounted for (see 
Douglas and Selinker 1985, Zuengler 1989). In particular, the assumption 
that grammatical and p'^^onological accuracy increases in a linear fashion is 
contradicted by evidence from second language acquisition studies whi^'h 
have shown systematic variability according tc the learner's psycho- 
sociological orientation (Meisel et al. 198 1); emotional investment in the 
topic (Eisenstein and Starbuck 1989); the discourse demands cf the task 
(Brown and Yule 1989); desired degree of social convergence/divergence 
(Rampton 1987); planning time available (Ellis 1987); and ethnicity and 
status of interlocutor (Beebe 1983) 

not only are the performai4i descriptions covertly norm-referenced (see 
above), but also there is no principled relationship between co-occurring 
performance features which figure in the one level (Skehan 1984, Brindley 
1986). 

it is very difficuh to specify relative degrees of mastery of a particular skill 
with sufficient precision to distinguish clearly between levels. This is 
illustrated by Alderson's (t9H9: U) comment on the development of the 
lELTS Speaking scales: 

For some criteria, for example proniniciado/i or ff ammatical accuracy, the 
difference in levels came down (a a different choice of quantifiers and we were 
faced with issues like is 'some' more than 'a few' but fewer than 'several* or 
'considerable' or *many\ How many is 'many*? 

the essentially interactive nature of oral communication is inadequately 
represented due to the restriction of the possible range or roles which can 
be assumed by the non-native speaker (Lantolf and Frawley 19H«; 
RaffyJdini 1988; van Licr 1989). 
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jh}^ Ihc dc&^i^ioiis arc highly conlcxl dependent and thus do not permit 
0 ^^^li^atlon about underlying ability (Bachnian and Savignon 1986; 
Skchan 1989). Methods such as ihc oral interview confuse trait and method 
(Bachman 1988), 

in the absence of concrete upper and lower reference points, criterion- 
referencing is not possible. Bachnian (1989: 17) points out that criterion- 
referencing requires the dcnnition of the end points an absolute scale of 
ability (so-called "zero" and "perfect" proficiency). Yet in practice, no-one 
has zero proficiency, since some language abilities arc universal. Similarly, 
native speakers vary widely in ability, which makes the "perfect speaker*" an 
equally Icnuous concept, 

CIciirly llio validity onhe criteria on wliich proficiency descriptions arc buill 
is by no means universally accepted. However, the controversy surrounding the 
construct validity of proficiency rating scales and performance descriptors is 
merely a manifestation of the fundamental question that CRA has to face: how 
to define the domain of <?biiity vhich is to be assessed, that is, language 
pToficicncy? Criterion-referencing depends on a very detailed and exact 
specification of the behavioural domain. But this amounts to asking the question 
posed hyxSpolsky (198()): 



U1ia( (iocs ii mean to kiunv how to usi' a la/igiiasc'/ 

As far as proficiency testing is concerned, a dcfiniiive answer to this 
question is clearly not presently on the horizon, although detailed and testable 
models sach as that proposed by Bachman (1990) offer some hope of describing 
more exactly the nature of communicative language ability. Meanwhile, in the 
context of classroom assessment, the move towards criterion-referencing 
continues. There is an increasing number of objectives-based assessment and 
profiling schemes derived from specification of real life communicative needs 
which allow cumulative attainment to be monitored and documented in the form 
of profiles of achievement (see Brindley 1989: 91-111), These present a way of 
linking classroom assessment closely to real-world outcomes. However, 
objcclives-based domain specifications also require the operationalization of the 
behaviour which forms the basis of the domain. As such, they are open to 
question on .c same grounds 'iS the proficiency descriptions described above. 
In addition, some testers would claim that pck-formancc testing associated with 
assessment of course objectives gives no information on underlying ability 
{Skchan 1989: 7), 
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^»^^.^i(em of domain specification is clearly far from being resolved. In 
ihe rAMmimc, disagreement on the validity of criteria wili no doubt coctinue, 
since there is as yet no description of language learning and language use on the 
basis of which universally agreed criteria could be drawn up. 



Attacking the domain specification problem 

Because of the limitations of context dependent proficiency deicriptions 
and the difficulties of relating these to an ^absoiutc' scale of ability, Bachenan 
(1989) argues that the only way to develop adequate CR procedures for 
assijssing communic::tivc language proficiency is to attempt to clearly specify the 
abilities that make up language proficiency and to define scales or levels of 
proficiency which arc independent of particular contexts, *in terms of the rctatiyc 
presence or absence of the abilities that constitute the domain' rather than 'in 
terms of actual individuals or actual performance' (Bachman 1989: 256). An 
example of such a scale is given below. 

yocabulvy Cohesion 
0 Extrtn ciy limited vocabula/y No cohesion 



(A few wordi and formulaic 
phrasc'i. No( possible to 
discuM any topic, due to 
limited vocabulary). 

/ Smaii vocabula/y 

(DiiTiculty in talking with 
examinee because of 
vocabulary limitations). 



(Utterances completely disjotaied, 

or discourse too short to judjc). 



Very Utile cohesion 

(RelaUonships betweca utleraoow 
not adequately marked; llreqoeaft 
coofusixig reUliooship anonf lieaa) 



2 Vocabulary ofmoderaU size Moderate cohesion 

(Frequently misses or searches (Relationships between uucnacei 
for words). generally marked; sometime* 

confusing relaltonshipi 

among ideas). 



3 Lar^ vocabula/y 



Good cohesion 



(Seldom mwses or searches i^elationshlp between uttcriaw 
for words). weU-ma^ked). 



4 Exiensive vocabula/y 



Excellent cohesion 



(Rarely, if ever, misses or (Uses a variety of apptopr^ie 
searches for words. Almost devices; hardly ever coofu»io| 
always uses appropriate word) relationships among ideas) 

Hgun I Scales of ability in vocabulary and coheHon (Bichman and 
PAlmer. 19*3) 
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443tTOWCvcr such scales, loo, are clearly fraughl wilh problems as Bachman 
and Savignon (1986: 388) recognize when ihey admit ihe difficuky of *specirying 
Ihe degree of control and range in terms that are specific enough to distinguish 
levels clearly and for raters to interpret consistently'. The sample scales, in fact, 
manifest many of the same problems which arise in the design of more 
conventional proficiency rating scales. The terminology used is very imprecise 
and relativisr'^c (Mimited'; 'frequently'; 'confusing' etc) and in the absence of 
precise examples of learners' language use at each of the levels, problems of 
rater agreement would inevitably arise. In fact, since the levels do not specify 
particular contexts, structure, functions and so on, raters would not have any 
concrete criterin to guide them. The difficulties of reaching agreement between 
ra(crs would, consequently, be likely to l)e even more acute. 



Consult expert judges 

Another commonly used way of producing criteria for proficiency testing is 
to ask expert judges to identify and sometimes to weight the k-.y features of 
learner performance which arc to be assessed. Experienced teachers lend to be 
the audience most frequently consulled in the development and refining of 
criteria and performance descriptions (eg. Westaway 1988; Alderson 1989; 
Griffin 1989). In some cases they may be asked to generate the descriptors 
themselves by describing key indicators of performance at different levels of 
proficiency. In others, test developers may solicit comments and suggestions 
from teachers for modification of existing descriptors on the basis of their 
knowledge and experience. 

In ESP testing, (est users may also surveyed in order to establish patterns of 
language usage and difficulty, including the relative importance of language tasks 
and skills. The survey results then serve «s a basis for lest specifications. This 
procedure has been followed in the development of tests of English for acadcmac 
purposes by, inter alia, Powers (1986), Hughes (1988) and Weir (1983, 1988) and 
by McNamara (1989) in the construction of tests of speaking and writmg for 
overseas-trained health professionals in Australia. 



Problems 

lyjio are ihe iLxpcns? 

The idea of using "expert judgetnent" appeals to logic and common sense. 
However it poses the question of who the experts actually arc. Conventionally it 
is teachers who provide ''expert" judgements, although increasingly other non- 
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^s..^^ itlsl users arc being involved in lest development. There are obvious 
reasons, of course, for appealing to tcaciscr judgements. They arc not difficull to 
obtain since teachers are on hand, they arc familiar with learners' needs and 
problems, Ihcy are able to analyse language and they can usually be assumed to 
be aware of the purposes and principles of language testing, even though they 
may not always be sympathetic to it. Although less obviously "expert' in the 
sense of being further removed from the language learning situation and less 
familiar with linguistic terminology, test users who interact with the target group 
(such as staff in tertiary institutions or employers) can similarly be presumed 
likely to have some idea of the language demands which will be made on the 
tcslco anil thus to he able to provide Usable informaiion ftir test developers. 

But in addition to teacheis and test useis, it could also be argued that 
testecs/learners themselves arc "experts" on matters relating to thfir own 
language use and that their perceptions should also be considered in drawmg up 
lest criteria and specifications. Self-assessment based on learner-generated 
criteria is becoming increasingly common practice in classroom-based formative 
assessment and quite high correlations ha%e been found between self-assessment 
and other external measures (Oskarsson 1989). However, learner perspectives 
have only recently begun to figure in proficiency test development (LcBlanc and 
Painchaud 1985; Bachman a-'-d Palmer 1988). 

So-called "naive" native speakers constitute another "expert" audience whose 
perceptions could profitably be drawn on in establishing performance criteria. A.<i 
Barnwell (1987) forcefully argues: 

tlic domain of proficiency is outside ihc classroom not inside. IVt can 

(perhaps) leave achievement testing to the teachers and professional testers, 
but once we aspire to measure proficiemy ti hmmws a question of vox populi, 
vox dei. 

Language is central to our humanity, and it is the most democratic ana 
egalitarian attribute we share Mth our fellow man. my then should we need 
'ax-pens' to tell us how well we speak? Huts it is not put an interesting no\xlty 
to contemplate the use of 'native' natives in profi-iency nesting and rating, it w 
a logical necessity which arises out of the nature of the thing we are trying to 
measure. 

Given that it is native speakei Judgements of proficiency which may well 
determine the future of testees, it clearly important to investigate on what basis 
these judgements arc made. As Clark and Lett (1988: 59) point out, comparing 
native speaker judgements with proficiency descriptors is one way of validating 
the descriptors in a non- ircular way and of establishing the external criteria 
which have been lacking Uj- to the present. 
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Data c^t^hn is resource-intensive 

^^4n order lo establish valid performance crilerhi, an analysis of Ihc Icslces* 
fulurc domain of language use is clearly desirable. However^ the collcclion of 
data for test conslruetion purposes poses a number of logistical difncuUies. 
From u practical point of view, the investigation of communicative needs is 
extremely resourcc-intensivc, to such an extent that the practical constraints of 
dala-gulhcrang may end up jeopardizing the purpose for which the data arc being 
gathered, (This same point had been made in relation to the rigorous needs 
assessment procedures which accompanied 'larget situation analysis" in ESP 
course development). An example is provided in a study by Stansfield and Powers 
(1989) aimed at validating the Test of Spoken English as a tool for the sctection 
and certification of non-native heatlh profcrsionals and to establish minimum 
stcntlards of proficiency. They state: 

of necessity wc asked for relatively global ratings, even for professionals and 
chose siiuations that would be representative and typical of those in which 
each professional /night be involved. No attempt was made to specify all 
the many situations that might be encountered, nor was any effort made to 
designate highly specific tasks. We might have asked about the degree of 
speaking proficiency needed in the performance of suigicat procedures^ for 
example (in which oral proficiency might be critical) but time limitations 
precluded such detail. In addition in this study, we decided to consider 
neither other important dimensions of communicative competence (eg. 
interpersonal skills and other affective components) nor functions of 
language (eg. persuading or developing rapport with patients) that might be 
highly desirable in various medical situotions^ 

In only considering global proficiency, a course of action they were forced 
lo lake through lack of necessary resources, the researchers neglected the 
information which would be considered most essential by some (prospective 
palienls is one group which springs to mind!) for test vnlUllly. 



Precise infonnation is difficult to elicit 

An additional problem in consulting test users or "naive" native speakers in 
drawing up criteria for assessment is the difficulty of gcltltig them to be 
sufficiently precise about situations of language use lo provide usable 
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Jnformali^nAwcrs (1986), reporting on his allcmpts to elicit information from 

"^c^t^ciDcrs on university stmlenls* listening patterns, observes that; 

the notion of analysing li-ttemns acdviUes mayhuvc been "fortfgn" to many 
faculty members who were not involved intensely in languor instruction or 
testing, In particular, such concepts as "discourse cuss" and "non-verbal 
signals" may be somewhat far afield for faculty in non-language disciplines, 
Moreover, while the rating of such passive, non-obscrmble skills as listening 
may be difficult generally, non-language oriented faculty may have even 
greater difficulty in detenuining when students encoium specific kinds of 
problems, 

Native speakers arc not languiige analysis. Nor arc most learners. It is 
hardly surprising, therefore, that the lest users' perceptions of language needs 
tend to be staled in rather vague terms. This is cxemplined by an cxaminaiion 
by Brindlcy, Ncsson and Woods (1989) of the language-related comments of 63 
university supervisors' monitoring reports on the progress of foreign students. 
They found thai the vast majority of the comments were of Ihe general kind 
("has problems with writing English"; "English expression not good"), though a 
few lecturers were able to identify particular goal-related skills ("has difficulty 
following lecturers-speak very fasl"). 

In a similar vein. Weir (19>:S: 73), commenting on the development of a 
test specification framework for (he TEEP test of English for academic purposes, 
notes lhat 

There is a need for more precise methods for dealing with task dimensions 
than the pragmatic ones used in our research. We relied heavily on the 
judgements of teachers and other expeils in the field, as well as on the results 
of small test administrations, to guide us on the appropriateness of task 
dimensions in the various constmas. Unless finer instnanents are developed 
than these rather coarse subjective estimates, it is difficult to see how fully 
parallel versions of the test can ever be developed. 



Expert judgement may be unreliable 

If expert opinion is to have any currency as a melhod of developing criteria, 
then one would expect llial a given group of expert judges would concur, first on 
the criteria which mcke up Ihc bebviourdl domain being assessed and second, 
on the allocation of particular performance features to particular levels. 
(Obtaining data in this way would be an integral part of construct validallon). 
One would also expect thai the group would be able to agree on the extent to 





iorflcsl item was testing a particular skill mid the level of difficuliy 
bsciitcd by the item (agreement wotiUl constitiUe evidence Tor content 
validity). 

Studies aimed at investigating how expert judgements are made, however, cast 
some doubt on tlie ability of expert judges to agree on any of these issues. Atderson 
(1988), for example, in an examination of item content in EFL reading tests, found 
that judges were unable to agree not only on what particular items were testing but 
also ori the level of difliculiy of items or skills and the assignment of these to a 
particular level, Devenncy (1989) who investigated the evaluative judgements of 
ESL teachers and studems of ESL compositions, found both within-group and 
betwcen-group differences in tise criteria which were used. He comments: 

Implicil in (lie notion of inier})rdi\'c communities are these assnmptionj: (I) 
a clear set of shared evaluative criteria exists, and (2) it will bs used by 
members of the interpretive commtmity to respond to text. Yet this did not 
prove to be the case for cither ESL teachers or students 

Dijfe/vnt people use different criteria 

Non-teacher native speakers, teachers and learners themselves, by virtue of 
their different backgrounds, experiences and expectations, have different 
understandings of the nature of language learning and communication. As a 
result, they tend to use different criteria to judge language ability and thus to pay 
attention Jo different features of second language performance. Studies of error 
gravity, for example, have shown that native speakers (end to be less concerned 
with grammatical accuracy than teachers (particularly those who are not native 
speakers of the language taught (Davies 1983)). This highlights the difPicultics of 
constructing assessment criteria and descriptors which can be consistently 
interpreted by different audiences. 

It is interesting, and perhaps significant, to note in the context of this 
discussion that disciplines outside applied linguistics interpret "communication" 
or "communicative competence" quite differently and hence employ different 
criteria for assessment. Communication theorists, for example, accentuate 
criteria such as empathy, behavioural flexibility and interaction management 
(Wicmann and Backlund 1980) and emphasise the role of non-verbal aspects of 
communication. In other fields, such as organisational management, 
communicative ability is seen very much in terms of "getting the job done" and 
the success of communication is thus judged primarily in relation to how well the 
outcomes are achieved rather than on specific linguistic features (Brindlcy 1989: 
122-23). McNamara (1987: 32) makes this point in relalion to doctor-patient 
communication, noting that in the medical profe.ssion "there Is a concern for the 
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^mmunj£a^p|)roccss in terms of ils oulcomcs". He comments (1987: 47) that 
^c^^^iXic approaches to •communicative ability' arc indeed narrow, and 
narrowly concerned with language rather than communicative behaviour as a 
whole". 

Two conclusions can be drawn from these observations, Firit, as 
McNamara (op. cit.) poinls out, wc must be conscious of the limitations of the 
claims which can be made about the capacity of language tests to predict 
communicative ability (in the broader sense) in rcaMife settings. Second, if real- 
life judgements of communicative effectiveness are based on perceptions of 
people's ability to use language to complete a task salisfactoriiy, then it is worth 
trying to build this nolion into assessment criteria. !n this regard, the use of 
"task fulfilment*' as a criterion in the lELTS writing assessment scales is a 
promising step in this direction (Westaway 1988). 



Teachen will be teachers 

Although teachers' judgements are frequently used as a basis for 
establishing assessment criteria, there is some evidence to suggest that the 
influence of their background and experience may be sufficiently strong {o 
override the criteria that are given. For example, in a preliminary analysis of 12 
videotaped moderation sessions of oral interviews conducted for the purp(»cs of 
rating speaking ability at class placemen! in the Australian Adult Migrant 
Education Program, I have found a consislenl lemlency for teachers to: 

refer to criteria which are not contained in the performance descriptors at 
all, such as confidence, motivation, risk-taking capacity and learning 
potential. 

concentrate heavily on the assessment of some features of performance at 
the expense of others. In ihis case, more time was spent discussing the role 
of the grammatical accuracy than any other single factor, even though the 
descriptions being used did not provide detailed or specific comments on 
grammatical features. 

use diagnoslically-oricnted and judgemental "teacher language' in applying 
the criteria, such as: 

She seemed to be weak on (vnscs 

I was a bit concerned about her word order generally 
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hi^m^iage was letting her down 

got weak tense fonns, not sure of her prepositions and quite often leaves 
off a final'i 

Caulley ct al (1988), report on a similar phenomenon in the context of the 
evaluation of common asscssnienl tasks useil in the Victorian senior secondary 
Englisli examination; 

in their discussions the teachers rarely or even referred to the specified criteria. 
Their assessments were largely global^ the language abstract and rarely 
substantiated by reference to anything concrete: 

This was exemplified hy C()m»«ents such as 

he's got communicative sense 
he's more sure of his material 
there's a lack of flow 
she hasn 7 crystallised her ideas 

They nole that 

teachers are involved with the growth and development of human beings 
through practice and in the end were shown to be neither willing nor able to 
divorce the performance of an action from those aspects of it such as 
intention, effort and risk^ which make it one performed by a growing and 
developing human beings, Vtey thus included in their assessment of students 
an estimate of the risk involved for the patlicular siudcnt to present as he or 
she did and something for the effort (or lack of effort) made in the 
preparation, although neither is mentioned in the guidelines. 

Although such non-linguistic factors do not convenlionally figure as criteria 
in definitions of proficiency, it would appear that Ihcy are included by teachers, 
perhaps because they are perceived as part of their educalor^s role. Spcciric 
assessment criteria may be developed rigorously and clearly spelled out, yet the 
teachers appear to be operating with their own constructs and applying their own 
criteria in spite of (or in addition to) those which they are given. This tendency 
may be quite widespread and seems to be acknowledged by Clark and Grognet 
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(1985: ^l^f in the following comment on the external validity of the Basic English 
SyttrTfist for non-English-speaking refugees in the USA: 

On the assumption that the proficiency-rating criterion is probably somewhat 
unreliable in its own right, as well as based to some extent on factors not 
directly associated with language proficiency per se (for example, student 
personality, diligence in completing assignments etc) even higher validity 
coefficients mif^t be shown using external criteria more dinxily and accurately 
reflecting language proficiency 

Further support for the contention that teachers operate with their own criteria 
is provided by a study carried out by Griffin (1989) who examined the 
consistency of the rating of lELTS writing scripts over time using a Rasch Rating 
scale nuulel. An analysis of rater statistics revealed that 

For assessment I, most raters appeared to fix' the underlying variable. On 
occasion 2, however, few raters appeared to fix the variable. Vtcre apf/ears to 
have been a change in the criteria or in the nature of the variable being used to 
assign scripts to levels. The original criteria used in the familiarisation 
workshop and rex7\forced in tlie training workshop do not seem to have been 
used for assessment Z Unfortunately it was assumed that the criteria would 
remain the same and were in fact supplied to the raters. 



He comments that 

raters seem to be influenced by their teaching background and the nature of 
the criteria used can differ from rater to rater. Consensus moderation 
procedures appear to have controlled this effect to some degree but not 
completely. 



CONCLUSION 

From this review of CRA, it should be clear, as Skchan (1984: 210) 
remarks, that "crUerion-referencing is an attractive ideal, but extremely difncult 
to achieve in practice". As we have seen, the criteria which are currently used 



(Griffin 1989; 10) 



(Orirnnl989: 13) 
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xn*l|lfccl what i& known alxnit the nature of language learning and use and 
p4y liot be consistently interpreted and applied even by cxpcrl judges. 
If the ideal of CRA is to be attained, it is necessary to develop criteria and 
descriptors which not only reflect current theories of language learning and 
language use but which also attempt to embody multiple perspectives on 
communicative ability. As far as the first of these requirements is concerned, 
Baclunan and his colleagues have put forward a research agenda to develop 
operational definitions of constructs in Bachman model of communicative 
language proficiency and validate these through an extensive program of lest 
development and research (see, for example, Bachman and Clark 1987; 
Bachnian et al 1988; Bachman 1990). One of '.he main virtues of this model, as 
Skehan (1990) points out, is that it provides a framework within which language 
testing research can be organised. Il is to Ik; hoped that the model will enable 
language ieslcrs to syslcniatically i.ivestigale the components of Sanguage ability 
as manifested in tests and thai the results of such research will be used to inform 
the specifications on which assessment instruments are based. 

Second language acquisition (SLA) research can also make a contribution 
li) the development ol empirically-derived criteria for language assessment which 
reflect the inherent variability and intersubjectivily of language use. First, 
research into task variability of ihe type reported in Tarone (1989), Taroi e and 
Yule (1989) and Gass el al (1989a: 1989b) provides valuable insights into the 
role that variables such as interlocutor, topic, social status and discourse domain 
might exercise on proficiency. Investigation of factors affecting fflj/c difficulty 
might also provide a more principled basis for assigning tasks to levels, a major 
problem in CRA. A number of testable hypotheses are outlined by Nunan 
(1989). 

Second, SLA research could also |)rovide much-needed information on the 
factors which influence native speaker perceptions of non-native speakers' 
proficiency. There is already a considerable literature on the overall 
communicative effect of non-native speaker communication (eg Albrechtscn et 
al 1980; Ludwig 1982; Eiscnstein 1983) and error gravity (eg James 1977; 
Chastain 1980; Davics 1983). However such studies have tended to examine the 
effects of particular discourse, phonological, syntactic or lexical features on 
comprehcn.sibility and/or irritation, rather than relating them to perceptions of 
proncicncy. Studies conducted with a specific focus on proficiency would assjsl 
in the creation of performance criteria which refiect those used in real life. 
Information of this kind is of critical importance since in many cases, it is the 
judgements of native speakers that will determine the future of language 
learners, not so much those of teachers. At the .same time, it is important to try 
lo e.stablish to what extent non- linguistic factors such as personality, social status, 
ethnicity, gender etc affect judgements of proficiency and the extent to which 
these factors can be related to linguistic ones (Clark and Lett 1987). 
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Third^l^carch into the naliire of dcvclopmcnUil sequences in learner 
an indication of the grammatical elements of language which can 



reaSSlly be expected for production at differenl stages and thus provides a 
basis for establishing assessment criteria which are consistent with the 
regularities of language development (Picneniaiin ct al 1988). In addition, since 
the multi-dimensional model of second language acquisition described by 
Pieneniann and Johnston (1987) makes strong predictions concerning the 
processing demands made by different lingulMif elements on learners, it should 
be possible lo incorporate these predictions into concrete hypotheses concerning 
task difficiihy which can be empirically investigated. 

Thus far I have sketched out the kinds of research that might contribute to 
the development of better criteria. As far as the intcnirclalion of the criteria is 
concerned, however, it would be naive to imagine that different judges will not 
continue lo interpret criteria idiosymtatitally. As Mcssitk (IWV) says: 

....expert judgement is fallible and may imperfectly apprehend domain 
structure or inadequately ivprescnl test siniclutv or Iwlh. 

Agreement between tester.s can be improved by familiarisation and training 
sessions in which raters, as Griffin (1989) reports. But there is always the 
possibility that agreement might conceal fundamental differences. As Bariiwell 
(1985) coniniciits: 

raters who a^r'-c on the level at which a camliiUili' can be placed may offer 
\'ciy different rea.wns for ihcii decisions 

Given, as we have seen, that different judges may operate with their own 
personalized constructs irrespective of the criteria they arc given, it would be a 
mistake to assume that high inter-rater reliability con.stitutes evidence of the 
construct validity of the scales or performance descriptors that are used. In 
order to provide such evidence, empirically-based investigation of the 
behavioural domain itself has to be carried out, as I have indicated above. At the 
same time, studies requiring teachers, learneis and native speakers are to 
externalize the criteria they (perhaps unconsciously) use to judge language 
ability would help lo throw some light on how judgements are actually made by a 
variety of different audiences and lead to a better understanding of the 
constructs that inform the criteria they use. The procedures "scd in the 
development of the lELTS band scales as reported by Westaway 1988), 
Alderson (1989), Griffin (1989) offer the possibility of building up a useful dat- 
ba.se in this area. . . i .i 

Finally, in the context of classroom CRA, the lime is ripe lo explore the 
feasibility of incorporating commuiiicativclyorientcd CRA into the teachmg and 



157 

20 





lcarm|g^pro€css. In (he field of general education, the results of research into 
j^^^c^dcvclopinent of CR '.nslruments for classroom use indicates that ihe 
problems of domain specification described in this paper m^y not be as 
intractable as they arc sometimes portrayed (Black and Dockrell 1984). 
Numerous CR schen^es for formative a^isessment and profiling are in existence 
in general education the United Kingdom and Australia (see Brindley 1989 for 
an overview) and appear to be quite adaptable to second language learning 
situations. The use of CR methods of assessing achievement based on 
communicative criteria would not only help to link teaching more closely to 
assessment, but also would allow for closer involvement of learners in 
monitoring and assessing their progress. 
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